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Zipf's law on word frequency is observed in English, French, Spanish, Italian, and so on, yet it 
does not hold for Chinese, Japanese or Korean characters. A model for writing process is proposed 
to explain the above difference, which takes into account the effects of finite vocabulary size. Ex- 
04 ' periments, simulations and analytical solution agree well with each other. The results show that 

the frequency distribution follows a power law with exponent being equal to 1, at which the corre- 
sponding Zipf's exponent diverges. Actually, the distribution obeys exponential form in the Zipf's 
(N. plot. Deviating from the Heaps' law, the number of distinct words grows with the text length in 

(—^ ' three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually 

, saturates. This work refines previous understanding about Zipf's law and Heaps' law in language 

■ systems. 

PACS numbers: 89.20.Hh, 89.75.Hc 
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Uncovering the statistics and dynamics of human lan- 
guage helps in characterizi ng th e universality, specificity 
and evolution of cultures llUllJ. Two scaling relations, 
Zipf's law [12| and Heaps' law [l3[, have attracted much 
attention from academic community. Denote r the rank 
of a word according to its frequency Z(r), Zipf's law is 
the relation Z(r) ~ r~ a , with a being the Zipf's expo- 
nent. Zipf's law was observed in many human languages, 
including English, French, Spanish, Italian, and so on 
[IH HH EH] ■ Heaps' law is formulated as N t ~ t x , where 
N t is the number of distinct words when the text length is 
t, and A < 1 is the so-called Heaps' exponent. These two 
laws coexists in many language systems. Gelbukh and 
Sidorov flol ] observed these two laws in English, Russian 
and Spanish texts, with different exponents depending on 
languages. Similar results were recently reported for the 
corpus of web texts [13], including the Industry Sector 
database, the Open Directory and the English Wikipedia. 
The occurrences of tags for online resources [HI, [l9[ , key- 
words for scientific publications [20j and words contained 
by web pages resulted from web searching [2l| also simul- 
taneously display the Zipf's law and Heaps' law. Inter- 
estingly, even the identifiers in programs by Java, C++ 
and C languages exhibit the same scaling laws |22j . 

The Zipf's law in language systems could result from 
a rich-get-richer mechanism as suggested by the Yule- 
Simon model 0, [24| , where a new word is added to a 
text with probability q and an appeared word is ran- 
domly chosen and copied with probability 1 — q. A word 
appears more frequently thus has high probability to be 
copied, leading to a power-law word frequency distribu- 
tion p(k) ~ fc -/3 with (3 = 1 + 1/(1 — 5). Dorogovtsev 
and Mendes modeled the language processing as evolu- 
tion of a word web with preferential attachment [2o| . 



TABLE I: The basic statistics of the four books. f3 is the 
exponent of the power-law frequency distribution and Nt is 
the total number of distinct characters. 
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The Story of the Stone 


727601 


4239 


21054 


1 


1.09 


The Battle Wizard 


1020336 


4178 


20028 


1 


1.03 


Into the White Night 


420935 


2182 


18992 


1 


1.00 


History of the Three Kingdoms 


157201 


1139 


5929 


1 


1.07 



Zanette and Montemurro [26| as well as Cattuto et al. 
[27j accounted for the memory effects, say the recently 
used words have higher probability to be chosen than 
the words occurred long time ago. These works can be 
considered as variants of the Yule-Simon model. Mean- 
while, the Heaps' law may originate from the memory 
and bursty nature of human language [28U30| . 

Real language systems to some extent deviate from 
these two scaling laws and display more complicated sta- 
tistical regularities. Wang et al. [3l[ analyzed represen- 
tative publications in Chinese, and showed that the char- 
acter frequency distribution exhibits an exponential fea- 
ture. Lii et al. [32| pointed out that in a growing system, 
if the appearing frequencies of elements obey the Zipf's 
law with stable exponent, then the number of distinct 
elements grows in a complicated way with the Heaps' 
law only an asymptotical approximation. This deviation 
from the Heaps' law was further emphasized and math- 
ematically proved by Eliazar [33|. Empirical analyses 
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on real language systems showed similar deviation [34 [. 
Via extensive analysis on individual Chinese, Japanese 
and Korean books, as well as a collection of more than 
5 x 10 4 Chinese books, we found even more complicated 
phenomena: (i) the character frequency distribution fol- 
lows a power law yet it decays exponentially in the Zipf's 
plot; (ii) with the increasing of text length, the number of 
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FIG. 1: (Color online) The character frequency distribution of The Story of the Stone: (al) p(k) with log-log scale and (a2) 
Z(r) with log-linear scale. The number of distinct words versus the text length of The Story of the Stone in (a3) log-log scale 
and (a4) linear- log scale. Similar plots in (bl-b4), (cl-c4) and (dl-d4) are for the books The Battle Wizard, Into the White 
Night and The History of the Three Kingdoms, respectively. The power-law exponent f3 is obtained by using the maximum 
likelihood estimation [35|, [3(| , while the exponent in the Zipf 's plot is obtained by the least square method excluding the head 
(i.e., r > 500 for Chinese books and r > 200 for Japanese and Korean books). 



distinct characters grows in three different stages: linear, 
logarithmical and saturated. All these unreported regu- 
larities may result from the finite vocabulary size, which 
is further verified by a simple theoretical model. 

We first show some experimental results about the sta- 
tistical regularities on Chinese, Japanese and Korean lit- 
eratures, which are typical examples generated from a 
vocabulary of very limited size if we look at the character 
level. There are in total more than 9 x 10 4 Chinese char- 
acters, yet only 3000 to 4000 of which arc used frequently 
(Taiwan and Hong Kong respectively identify 4808 and 
4759 frequently used characters, while mainland China 
has two versions of the list of frequently used charac- 
ters, one contains 2500 characters and the other contains 
3500 characters), and the number of Japanese and Ko- 
rean characters are even smaller. We start with four fa- 
mous books, the first two are in Chinese, the third one is 



in Japanese and the last one is in Korean: (i) The Story 
of the Stone (aliases: The Dream of the Red Chamber, 
A dream of Red Mansions and Hong Lou Meng), writ- 
ten by Xueqin Cao in the mid-eighteenth century during 
the reign of Emperor Chien-lung of the Qing Dynasty; 
(ii) The Battle Wizard (aliases: Tian Long Ba Bu and 
Demi- Gods and Semi-Devils), a kung fu novel written 
by Yong Jin; (iii) Into the White Night, a modern novel 
written by Higashino Kcigo; (iv) The History of the Three 
Kingdoms, a very famous history book by Shou Chen in 
China and then translated into Korean. These books 
cover disparate topics and types and were accomplished 
in far different dates. The basic statistics of these books 
are presented in Table 1. 

Figure 1 reports the character frequency distribution 
p(k), the Zipf's plot on character frequency Z(r) and the 
growth of the number of distinct characters N t versus 
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FIG. 2: (Color online) The growth of distinct characters in 
the collection of 57755 Chinese books. The result is obtained 
by averaging over 100 randomly determined orders of these 
books. 



the total number of characters appeared in the text. As 
shown in figure 1, the character frequency distributions 
are power-law, meanwhile the frequency decays exponen- 
tially in the Zipf's plot, which is in conflict to the com- 
mon sense that a power-law probability density function 
always corresponds to a power-law decay in the Zipf's 
plot. Actually, there exists a relation between two expo- 
nents a and /3 as a = j^j HH j an d thus when /3 gets close 
to 1, the exponent a will diverge and thus the decaying 
function in Zipf's plot could not be well characterized 
by a power law. Therefore, if we observe a non-power- 
law decaying in the Zipf's plot, we cannot immediately 
deduce that the corresponding probability density func- 
tion is not a power law - it is possibly a power law with 
exponent close to 1. Note that, in the Zipf's plots, the 
turned-up head contains a few hundreds of characters, 
majority of which play the similar role to the auxiliary 
words, conjunctions or prepositions in English. 

Figure 1 also indicates that the growth of distinct char- 
acters cannot be described by the Heaps' law. Indeed, 
there are two distinguishable stages: In the early stage, 
Nt grows approximately linearly with the text length t, 
and in the later stage, Nt grows logarithmically with t. 
Figure 3 presents the growth of distinct characters for 
a large collection of 57755 Chinese books consisting of 
about 3.4 x 10 9 characters and 12800 distinct characters. 
In addition to those observed in figure 1 and figure 2, 
N t displays a strongly saturated behavior when the text 
length t is much bigger than the total distinct charac- 
ters in the vocabulary. In summary, the experiments on 
Chinese, Japanese and Korean literature show us some 
unreported phenomena: the character frequency obeys a 
power law with exponent close to 1 yet it decays expo- 
nentially in the Zipf's plot, and the number of distinct 
characters grows in three distinguishable stages. We next 
propose a theoretical model to explain these observations. 

Consider a vocabulary with finite number, V, of dis- 
tinct characters or words. At each time step, one charac- 
ter will be selected from the vocabulary to form the text. 
Motivated by the rich-get-richer mechanism of the Yule- 
Simon model, at time step t, if the character i has been 
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FIG. 3: (Color online) Growth of the number of distinct char- 
acters versus time for different V and e according to Eq. [3] 
Plots (a) and (c) are in log-log scale while while (b) and (d) 
are their corresponding plots in linear-log scale. 



used fcj times, it will be selected with the probability 

ki + e ki + e 



(1) 



where e is the initial attractiveness of each character. As- 
suming that at time t, there are Nt distinct characters in 
the text, and we first investigate the dependence of Nt 
on the text length t. The selection at time t + 1 can be 
equivalently divided into two complementary yet repul- 
sive actions: (i) to select a character from the original 
vocabulary with probability proportional to e, or (ii) to 
select a character from the N t words in the created text 
with probability proportional to its appeared frequency. 
Therefore the probability to choose a character from the 
original vocabulary is y^f^ , whereas v * +t from the cre- 
ated text. A character chosen from the created text is 
always old, while a character chosen from the vocabulary 
could be new with probability 1 — Accordingly, the 
probability that a new character appears at the t+1 time 
step, namely the growing rate of Nt , is 



dN< 



Ve 



,1-- 

dt Ve + t\ V 



(2) 



With the boundary conditions No = and N x = V, we 
derive the solution of Eq. [2] as 



N t = V 



1 - 



Ve 



Ve + t 



(3) 



This solution embodies three stages of growth of Nt. 
(i) In the very early stage, when t is much smaller than 
Ve, ( ye+t Y ~ 1 — y and thus N t « t, corresponding 
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to a short period of linear growth, (ii) When t is of the 
same order of Ve, if e is very small, N t could be much 
smaller than V. Then Eq. 2 can be approximated as 



dN t 
dt 



Ve 



Ve + f 

leading to a logarithmical solution 



N t ps Vein 1 



Indeed, expanding ( 



Ve - 
Ve+t' 



t 

Ve 



by Taylor series as 



= y — 



Ve 



Ve + t I 

m— u 



e ■ In 



Ve + t 



(4) 



(5) 



(6) 



iV t +ij>(fc + 1) = N tP (k + 1)1 



(8) 

Via continuous approximation, it turns to be the follow- 
ing differential equation 



dp 
P 



Ve + 1 
l + ^^(N t+l -N t ) 



dk 
k + e' 



(9) 



Substituting N t+1 - N t = dN t /dt and Eq. H we get the 
solution 



P 



(k) = J9(fc + e) _ [ 1+e (& -1 )] 



(10) 



where B is the normalized factor. The result shows that 
the character frequency follows a power-law distribution 
with exponent changing in time. Considering the finite 
vocabulary size, in the large limit of t, N t — >• V and thus 

the power-law exponent, /3 = 1+e (j^ — lY approaches 
1. Under the continuous approximation, the cumulative 
distribution of character frequency can be written as 



P(k > k ) = 1 



p(k)dk = 1 - B 



1-/3 



(11) 



Simulation 
-Analytic solution 



and neglecting the high-order terms (m > 2) under the 
condition £< 1, one can also arrive to the solution Eq. 5. 
(iii) When t gets larger and larger, N t will approach to 
V and thus both v v e + t and 1 — -^f are very small, leading 
to a very slow growing of Nt according to Eq. 2. These 
three stages predicted by the analytical solution are in 
good accordance with the above empirical observations. 

Figure 3 reports the numerical results on Eq. 3. In 
accordance with the analysis, when t is small, Nt grows 
in a linear form as shown in Fig. 3(a) and 3(c), and from 
Fig. 3(b) and 3(d), straight lines appear in the middle 
region, indicating a logarithmical growth predicted by 
Eq. 5. 

Denote by n(t, k) the number of distinct characters 
that appeared k times until time t, then n(t, k) = N t p(k). 
According to the master equations, we have 

n(t + 1, k + 1) = n(t, k + 1) [1 - f{k + 1)] + n(t, k)f(k). 

(7) 

Substituting Eq. Q] into Eq. [3 we obtain 

fc+ 1 + i N t p(k)(k + l) 
Ve + t 




FIG. 4: (Color online) Comparison between simulations re- 
sults (blue data points) and analytical solutions (red curves) 
for typical parameters V = 1000 and e = 0.01. The subgraphs 
(a) and (c) are plotted in log-log scale, while (b) and (d) are 
the same data points to (a) and (b) displayed in log-linear 
and linear-log scales, respectively. The results are obtained 
by averaging over 100 independent runs with text length be- 
ing equal to 10 6 . 



where fc m i n is the smallest frequency. When (3 — > 1, 
k 1 ' 13 ps 1 + (1 - /3)\nk, and thus 



P(k > k ) = 1 - Bin 



ko 



(12) 



where B 



In 



4 +? 



according to the normal- 
p(k)dk = 1 and fc m ax is the 



ization condition j£™ 
highest frequency. According to Eq. 12, there are 



1 - Bin 



k+e 



N t characters having appeared more 



than k times. That is to say, a character having appeared 



k times will be ranked at r = 1 
Therefore 



1 - Bin 



k+e 



N 



Z(r) 



(k n 



ejexp 



e, (13) 



and Z(\) = A: max , Z(N t ) — fc m ; n . In a word, this simple 
model accounting for the finite vocabulary size results in 
a power-law character frequency distribution p(k) ~ k~^ 
with exponent /3 close to 1 and an exponential decay of 
Z(r) in the Zipf's plot, which perfectly agree with the 
empirical observations on Chinese, Japanese and Korean 
books. 

Figure 4 reports the simulation results for typical pa- 
rameters. The power-law frequency distribution, the ex- 
ponential decay of frequency in the Zipf's plot and the 



5 



linear to logarithmic transition in the growth of the dis- 
tinct number of characters are all clearly observed in the 
simulation. The simulation results agree very well with 
the analytical solutions presented in Eq. 3, Eq. 10 and 
Eq. 13. 

Previous statistical analyses about human language 
overwhelmingly concentrate on Indo-European family, 
where each language consists of a huge number of words. 
In contrast, languages consisting of characters, though 
cover more than a billion people, obtained less atten- 
tion. These languages include Chinese, Japanese, Ko- 
rean, Vietnamese, Jurchen language, Khitan language, 
Makhi language, Tangut language, and many others. 
Empirical studies here show remarkably different scaling 
laws of character-formed from word-formed languages. 
Salient features include an exponential decay of character 
frequency in the Zipf's plot associated with a power-law 
frequency distribution with exponent close to 1, and a 
multi-stage growth of the number of distinct characters. 
These findings not only complement our understanding 
of scaling laws in human language, but also refine the 
knowledge about relationship between the power law and 
the Zipf's law, as well as the applicability of the Heaps' 
law. As a result, we should be careful when applying 
the Zipf's plot for a power-law distribution with expo- 
nent around 1, such as the cluster size distribution in 



two-dimensional self-organized critical systems 1371 . the 
inter-event time distribution in human activities [38[ , the 
family name distribution in Korea [39| . species lifetime 
distribution [40| . and so on. Meanwhile, we cannot deny 
a possibly power-law distribution just from a non-power- 
law decay in the Zipf's plot (HJ . 

The currently reported scaling laws can be reproduced 
by considering finite vocabulary size in a rich-get-richer 
process. Different from the well-known finite-size ef- 
fects that vanish in the thermodynamic limit, the effects 
caused by finite vocabulary size get stronger as the in- 
creasing of the system size. Finite choices must be a 
general feature in selecting dynamics, but not a neces- 
sary ingredient in growing systems. For example, also 
based on the rich- get- richer mechanism, neither the lin- 
ear growing model [4l[ nor the accelerated growing model 
[4^ | (treating total degree as the text length and nodes as 
distinct characters, the accelerated networks grow in the 
Heaps' manner [32| ) has considered such ingredient. The 
present model could distinguish the selecting dynamics 
from general dynamics for growing systems. 

This work is partially supported by the Swiss Na- 
tional Science Foundation (Project 200020-132253) and 
the Fundamental Research Funds for the Central Univer- 
sities. 
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